Skip to content

Routed: support vxlan networks #10861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 4.20
Choose a base branch
from

Conversation

weizhouapache
Copy link
Member

Description

This PR supports Routed network with vxlan isolation

fixes #10855

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link

codecov bot commented May 13, 2025

Codecov Report

Attention: Patch coverage is 25.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 16.14%. Comparing base (52d9860) to head (bc0c3db).
Report is 18 commits behind head on 4.20.

Files with missing lines Patch % Lines
.../java/com/cloud/network/guru/GuestNetworkGuru.java 14.28% 11 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               4.20   #10861   +/-   ##
=========================================
  Coverage     16.14%   16.14%           
  Complexity    13237    13237           
=========================================
  Files          5655     5655           
  Lines        497310   497314    +4     
  Branches      60277    60276    -1     
=========================================
+ Hits          80278    80294   +16     
+ Misses       408088   408075   -13     
- Partials       8944     8945    +1     
Flag Coverage Δ
uitests 4.00% <ø> (ø)
unittests 16.99% <25.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13370

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13295)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 57243 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10861-t13295-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@weizhouapache
Copy link
Member Author

@sjanssen15

you can build the projects in a container and copy the JAR to your management server, by following the steps below

    # 1. create docker container, and go into it

    docker run -d -it --name ubuntu22 ubuntu:22.04
    docker exec -it ubuntu22 bash

    # 2. clone repo

    apt update
    apt install -y git wget
    git clone https://github.com/apache/cloudstack.git -b 4.20.0.0

    # 3. apply patch


    wget https://github.com/apache/cloudstack/pull/10861.patch
    git config --global user.email "[email protected]"
    git config --global user.name "Your Name"
    git am 10861.patch
    git log -n2 # patch should be applied

    # 4. install build dependencies


    echo "y" | apt build-dep .

    git clone https://github.com/shapeblue/cloudstack-nonoss.git nonoss && cd nonoss && bash -x install-non-oss.sh && cd ..
    rm -fr nonoss

    # 5. build the project


    mvn -P developer,systemvm -Dnoredist clean install -DskipTests

    # 6. post-build

    # please copy client/target/cloud-client-ui-4.20.0.0.jar to  /usr/share/cloudstack-management/lib/cloudstack-4.20.0.0.jar on the management servers
    # restart cloudstack-management service

@sjanssen15
Copy link

@sjanssen15

you can build the projects in a container and copy the JAR to your management server, by following the steps below

    # 1. create docker container, and go into it

    docker run -d -it --name ubuntu22 ubuntu:22.04
    docker exec -it ubuntu22 bash

    # 2. clone repo

    apt update
    apt install -y git wget
    git clone https://github.com/apache/cloudstack.git -b 4.20.0.0

    # 3. apply patch


    wget https://github.com/apache/cloudstack/pull/10861.patch
    git config --global user.email "[email protected]"
    git config --global user.name "Your Name"
    git am 10861.patch
    git log -n2 # patch should be applied

    # 4. install build dependencies


    echo "y" | apt build-dep .

    git clone https://github.com/shapeblue/cloudstack-nonoss.git nonoss && cd nonoss && bash -x install-non-oss.sh && cd ..
    rm -fr nonoss

    # 5. build the project


    mvn -P developer,systemvm -Dnoredist clean install -DskipTests

    # 6. post-build

    # please copy client/target/cloud-client-ui-4.20.0.0.jar to  /usr/share/cloudstack-management/lib/cloudstack-4.20.0.0.jar on the management servers
    # restart cloudstack-management service

Hi, thank you! I just build this version and it does seem to work now when creating a network and allocating a subnet. However I notice the CPU usages is 100% and the environment is very slow. Any idea if I should continu testing or first fix this issue?
image

@weizhouapache
Copy link
Member Author

Hi, thank you! I just build this version and it does seem to work now when creating a network and allocating a subnet. However I notice the CPU usages is 100% and the environment is very slow. Any idea if I should continu testing or first fix this issue? image

@sjanssen15
it should not happen.
can you check management-server.log to see if there are some errors or exceptions ?

@sjanssen15
Copy link

Hi, thank you! I just build this version and it does seem to work now when creating a network and allocating a subnet. However I notice the CPU usages is 100% and the environment is very slow. Any idea if I should continu testing or first fix this issue? image

@sjanssen15 it should not happen. can you check management-server.log to see if there are some errors or exceptions ?

It seemed to be caused by 2 old KVM hosts that were trying to connect to this management server but they are from a previous installation, they were kinda DDOSing this machine. After adding them the issue disappeared. Tomorrow I will test some more regarding the router and stuff.

@weizhouapache
Copy link
Member Author

Hi, thank you! I just build this version and it does seem to work now when creating a network and allocating a subnet. However I notice the CPU usages is 100% and the environment is very slow. Any idea if I should continu testing or first fix this issue? image

@sjanssen15 it should not happen. can you check management-server.log to see if there are some errors or exceptions ?

It seemed to be caused by 2 old KVM hosts that were trying to connect to this management server but they are from a previous installation, they were kinda DDOSing this machine. After adding them the issue disappeared. Tomorrow I will test some more regarding the router and stuff.

Good, thanks for testing

If there is no major issue, we can merge this PR into 4.20.1

@weizhouapache weizhouapache marked this pull request as ready for review May 19, 2025 15:06
@sjanssen15
Copy link

Hi, thank you! I just build this version and it does seem to work now when creating a network and allocating a subnet. However I notice the CPU usages is 100% and the environment is very slow. Any idea if I should continu testing or first fix this issue? image

@sjanssen15 it should not happen. can you check management-server.log to see if there are some errors or exceptions ?

It seemed to be caused by 2 old KVM hosts that were trying to connect to this management server but they are from a previous installation, they were kinda DDOSing this machine. After adding them the issue disappeared. Tomorrow I will test some more regarding the router and stuff.

Good, thanks for testing

If there is no major issue, we can merge this PR into 4.20.1

I'm really sorry to bother you, but I'm going from error to error. Can you see why I'm unable to run any instances? System VM's are running so KVM is working. I have 3 hosts added to the cluster. Could it be related to the network?
unable-to-allocate.txt

@weizhouapache
Copy link
Member Author

I'm really sorry to bother you, but I'm going from error to error. Can you see why I'm unable to run any instances? System VM's are running so KVM is working. I have 3 hosts added to the cluster. Could it be related to the network? unable-to-allocate.txt

it looks the vm template has a template tag as emty string ('')
please update template_tag in vm_template to NULL

@sjanssen15
Copy link

I'm really sorry to bother you, but I'm going from error to error. Can you see why I'm unable to run any instances? System VM's are running so KVM is working. I have 3 hosts added to the cluster. Could it be related to the network? unable-to-allocate.txt

it looks the vm template has a template tag as emty string ('') please update template_tag in vm_template to NULL

To give you a quick update, still not sure how I fixed it but I reset some template permissions and tags and it seemed to start working. Strange since this is a clean install.

I have now a VXLAN segment with a virtual router attached to it. I'm able to ssh into the VR and ping the running instance (from a different host as well). I'm however unable to reach the public network where the VR is attached to. I do have the correct VLAN and type configured. On the KVM host itself I'm able to reach the public gateway via that cloudbridge.
image

@weizhouapache
Copy link
Member Author

I have now a VXLAN segment with a virtual router attached to it. I'm able to ssh into the VR and ping the running instance (from a different host as well). I'm however unable to reach the public network where the VR is attached to. I do have the correct VLAN and type configured. On the KVM host itself I'm able to reach the public gateway via that cloudbridge.

You cannot reach the gateway 10.4.152.1 from the VR, right ?

can you assign a temporary IP (in 10.4.152.1/24 network) to the linux bridge (brXXXX.1452) on the kvm host, and check if the .1 and .102 are reachable ?

@sjanssen15
Copy link

sjanssen15 commented May 21, 2025

I have now a VXLAN segment with a virtual router attached to it. I'm able to ssh into the VR and ping the running instance (from a different host as well). I'm however unable to reach the public network where the VR is attached to. I do have the correct VLAN and type configured. On the KVM host itself I'm able to reach the public gateway via that cloudbridge.

You cannot reach the gateway 10.4.152.1 from the VR, right ?

can you assign a temporary IP (in 10.4.152.1/24 network) to the linux bridge (brXXXX.1452) on the kvm host, and check if the .1 and .102 are reachable ?

I tried with ip addr add 10.4.152.14/24 dev brens1np1-1452 but still won't work. I'm able to ping the 10.4.152.1 since my cloudbr3 is configured with a working configuration. I use the following netplan config (I redacted the other bridges). One perhaps important caveat: I have to rename "ens7f1np1" since they are too long in order for Linux to create bridges for them, this is known limitation (15 characters) (brens7f1np1-1452 = 16 chars). Could this be the issue perhaps?

network:
  version: 2
  ethernets:
    eno5np0: {}
    eno6np1: {}
    ens0np0:
      match:
        name: ens7f0np0
      set-name: ens0np0
    ens1np1:
      match:
        name: ens7f1np1
      set-name: ens1np1
  vlans:
    < redacted >
    ens1np1.1452:
      id: 1452
      link: "ens1np1"
  bridges:
    < redacted >
    cloudbr3:
      addresses:
      - "10.4.152.10/24"
      routes:
      - to: "10.4.152.0/24"
        via: "10.4.152.1"
      dhcp4: false
      dhcp6: false
      interfaces: [ ens1np1.1452 ]

@weizhouapache
Copy link
Member Author

@sjanssen15
I think the problem is that ens1np1.1452 uses cloudbr3 as its master bridge.
You can try to stop the VRs and rename cloudbr3 to brens1np1-1452

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants